'.\" t
.TH "clhbd" "1M" "Jun 20, 2006" "1\&.2\&.0"
.SH NAME
clhbd \- Linuxha.net Cluster Heartbeat Daemon

.SH SYNOPSIS
.TS
l l.
clhbd	[\fB--detach\fP] [\fB--verbose\fP] [\fB--file\fP \fBFile\fP] [\fB--config\fP \fIfile\fP]
.TE

.SH DESCRIPTION
\fIclhbd(1M)\fP is the daemon has two responsibilities in a Linuxha.net
cluster;

.TP
.B *
Listening for 'echo' requests from the other node in the cluster.
.TP
.B *
Sending out 'echo' requests to the other node in the cluster.

.RE
To ensure this is a performant operation two separate processes are 
involved - one for each of the tasks specified above. The frequency of the
network requests are determined by the 'warn' and 'dead' times configured
as part of the cluster. If running at the smalltest recommended values 
(1 second for 'warn' and 2 seconds for 'dead', it will send out 2 requests
a second.

The requests are send over \fBall\fP configured networks in the 
cluster topology. This ensures that loss of a single network connection,
(whether temporary whilst a IP fail-over occurs to another card, or
permanently if a card fails without an available alternative), does not
cause problems when multiple networks are configured.

Although all traffic is encrypted to the same degree as all other
network traffic used in Linuxha.net the protocol is lightweight since it 
is unidirectional and should not lead to a noticeable CPU or network load.

.SH FAILURE RESPONSES
The whole purpose of the daemon is to spot when the other node in the cluster
appears to have failed. It does this by updating the 'last successful
response time' for any packet received from the remote node.

As long as at least one packet from at least one network arrives more frequently
than the configured 'warn' time no problems result. Even if a complete network
fails this daemon takes no actions - assuming it receives packets from the
other networks configured.

When no packets from any configured network arrive in the 'warn' time period
a warning will be logged in the configured log file. This even has no affect
on availability; it is designed to ensure that administrators consider 
the network load to ensure the timings configured are suitable for the
hardware in question.

However if a daemon gets to responses in the period of time defined as the
\'dead' time for the cluster it attempts to gain an SSH connection to the 
remote node - and only when that fails will it consider the remote node
to be down. 

Yet that is not that happens; instead it will consult the list (if configured)
of well known external hostnames or IP addresses and attempt to ping them. 
If the number of successful pings is larger than a configured threshold 
then the remote node really is dead and it will send a message to the 
main cluster daemon which will ensure it attempts to start-up the relevant
applications locally.

If the list of well known IP addresses (if configured) also fail then this 
daemon simply indicates to the cluster daemon that it should consider itself
\'PARTITIONED' - it is no long part of the network and should not attempt to do anything apart from wait for a network reconnection.

Notice that the daemon will make an attempt to contact the remote node by
SSH via the hostname in question. This is to ensure that if the remote heartbeat
daemon dies it is noticed and logged and no fail-over occurs. It is also
designed to occur to allow daemons to be stopped/started without downtime. In
this situation the heartbeat will wake-up again once packets are received 
from the remote machine again.

.SH OPTIONS
.TP
.B \-A,--detach
If not specified the daemon will run as a foreground process. In 99% of all
cases this argument is used since it will only then run like a true daemon.
.TP
.B \-N,--verbose
Run in verbose mode - in this instance the log file will be populated when
key events occur - it does not generate lots of output and so is 
recommended [and is always used when the cluster is started via the
\fIclform(1M)\fP utility.
.TP
.B \--nochecksums
Normally if the cluster or application configuration files do not match the
expected checksums the command will abort. Specification of this argument
will override this behaviour.
.TP
.B \--config
Override the default cluster configuration file using the file specified. This
is essentially used for developers only.

.SH CAVEATS
When checking for the presence of the other node when the heartbeats are not
received for the 'dead' period the SSH session is given 3 seconds to time out.
It is not currently possible to tune this timeout period.

.SH EXIT CODES
If the program exits then an error has occured - this is a daemon process
that should contine running until specified to do otherwise.

.SH SEE ALSO
.TS
l l.
cldaemon(1M)	- Main Cluster Daemon
clnetd(1M)	- Cluster Network Daemon
.TE

.SH AUTHOR
Simon Edwards, simon.edwards@linuxha.net - http://www.linuxha.net
